Skip to content

Conversation

@bluecrayon52
Copy link
Contributor

Issue #, if available:

Description of changes:
Updated OFI NCCL plugin path to /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu
Updated CUDA toolkit path to /usr/local/cuda/lib64
Updated OFI NCCL tuner plugin path to /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/libnccl-ofi-tuner.so

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@amanshanbhag amanshanbhag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nghtm
Copy link
Contributor

nghtm commented Oct 30, 2025

@bluecrayon52 can we close out on comments and merge?

@bluecrayon52
Copy link
Contributor Author

Apologies for the delay on getting back this. Thankfully, it looks like @erezzarum had some helpful insights into how we can clean things up.

How are new builds of the public.ecr.aws/hpc-cloud/nccl-tests image coordinated with updates to nccl-tests.Dockerfile? I see that #881 was approved, and if its ready to merge, I'd like to coordinate merging this PR with a new build of the public.ecr.aws/hpc-cloud/nccl-tests image to ensure all the changes are reflected.

I was able to confirm that the following defaults were set in /etc/ld.so.conf.d:

kubectl exec efa-debug-pod -- cat /etc/ld.so.conf.d/000_efa.conf
/opt/amazon/efa/lib

kubectl exec efa-debug-pod -- cat /etc/ld.so.conf.d/100_ofinccl.conf
/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/

 kubectl exec efa-debug-pod -- cat /etc/ld.so.conf.d/nvidia.conf
/usr/local/cuda/lib64

However, using the current build of public.ecr.aws/hpc-cloud/nccl-tests required setting /opt/nccl/build/lib in LD_LIBRARY_PATH otherwise the job fails with the following error:

[1,11]<stderr>:/opt/nccl-tests/build/all_reduce_perf: error while loading shared libraries: libnccl.so.2: cannot open shared object file: No such file or directory

Likewise, I also had to keep /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu in LD_LIBRARY_PATH otherwise the the tuner is not found and the job falls back to TCP:

[1,7]<stdout>:nccl-tests-worker-0:34:72 [7] NCCL INFO TUNER/Plugin: Could not find: ofi libnccl-tuner-ofi.so. Using internal tuner plugin.
[1,8]<stdout>:nccl-tests-worker-1:21:103 [0] NCCL INFO Channel 03/0 : 0[0] -> 8[0] [receive] via NET/Socket/0

FI_EFA_FORK_SAFE=1, NCCL_BUFFSIZE=8388608, and NCCL_P2P_NET_CHUNKSIZE=524288 are all redundantly set to their respective defaults and can be removed.

The example of MPIJob that @erezzarum provided in #881 is lean, and I'm inclined to follow the same pattern once a new build of the public.ecr.aws/hpc-cloud/nccl-tests is available with the applied changes. I believe @iankouls-aws owns this process. I'll ping him for insights.

I would make same changes to nccl-tests-gb200.yaml, but given that it uses a separate ARM image (public.ecr.aws/u1m6g1t5/nccl-tests-arm64:latest) and the lack of access to p6e-gb200.36xlarge instances for testing, I wouldn't be able to validate that it works.

@erezzarum
Copy link

It seems the latest version of nccl-tests container image is not yet reflecting my changes.
Can you please build the container image yourself and test it? from my testing no extra LD_LIBRARY_PATH are needed.

From the Dockerfile.

# For ofi-nccl set paths for both aarch64 and x86_64
ENV LD_LIBRARY_PATH=/opt/amazon/ofi-nccl/lib/aarch64-linux-gnu:/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

ENV PATH=/opt/amazon/openmpi/bin:/opt/amazon/efa/bin:$PATH

###################################################
## Install NCCL
RUN git clone -b ${NCCL_VERSION} https://github.com/NVIDIA/nccl.git  /opt/nccl \
    && cd /opt/nccl \
    && make -j $(nproc) src.build CUDA_HOME=/usr/local/cuda \
    NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_100,code=sm_100" \
    && echo "/opt/nccl/build/lib" > /etc/ld.so.conf.d/000_nccl.conf \
    && ldconfig

This is an example of what i tested with

...
...
        spec:
          containers:
          - image: <PRIVATE ECR URI OF IMAGE REDACTED>
            imagePullPolicy: IfNotPresent
            name: test-nccl-launcher
            command:
            - /opt/amazon/openmpi/bin/mpirun
            - --allow-run-as-root
            - --tag-output
            - -np
            - "16"
            - -N
            - "8"
            - --bind-to
            - none
            - -x
            - PATH
            - -x
            - LD_LIBRARY_PATH
            - -x
            - FI_PROVIDER=efa
            - -x
            - NCCL_DEBUG=INFO
            - -x
            - NCCL_TUNER_PLUGIN=ofi
            - --mca
            - pml
            - ^ucx
            - --mca
            - btl
            - tcp,self
            - --mca
            - btl_tcp_if_exclude
            - lo,docker0,veth_def_agent
            - /opt/nccl-tests/build/all_reduce_perf
            - -b
            - "8"
            - -e
            - "10G"
            - -f
            - "2"
            - -c
            - "1"
            - -g
            - "1"
            - -n
            - "100"
            - -w
            - "10"
...
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants